First Steps with Pandas



In [38]:

    
import pandas as pd

Importing Data



In [39]:

    
df = pd.read_csv('scraped_and_cleand_six.csv')

Looking at summary of the data



In [40]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1169 entries, 0 to 1168
Data columns (total 7 columns):
Unnamed: 0     1169 non-null int64
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 64.0+ KB



In [11]:

    
df['Share Total'].apply(comma)









    Out[11]:





0         14323
1         12032
2           500
3           500
4            15
5             6
6          2000
7       2190000
8        160000
9          3118
10        82000
11          787
12        13543
13         3290
14         5000
15       950000
16        40000
17          400
18      1460000
19          743
20           13
21         4000
22         2000
23         3418
24         1200
25         5000
26          100
27          868
28       622016
29          190
         ...   
1149          4
1150     150000
1151        316
1152        134
1153        131
1154        131
1155       5410
1156       9830
1157        513
1158          9
1159        105
1160         41
1161        153
1162       5000
1163       1214
1164        267
1165          4
1166         30
1167       1000
1168        292
1169          5
1170         57
1171      23703
1172         11
1173         57
1174       1800
1175         57
1176      30000
1177         50
1178         57
Name: Share Total, Length: 1179, dtype: object



In [12]:

    
df['Share Total'] = df['Share Total'].apply(comma)



In [13]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null object
Share Total    1179 non-null object
Type           1179 non-null object
dtypes: object(5)
memory usage: 46.1+ KB



In [14]:

    
df['Share Total'] = df['Share Total'].astype(int)



In [15]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null object
Share Total    1179 non-null int64
Type           1179 non-null object
dtypes: int64(1), object(4)
memory usage: 46.1+ KB

Lets sort the values



In [16]:

    
df['Share Total'].sort_values(ascending=False).head(20)









    Out[16]:





1017    27205398
364      4859499
610      3496500
538      3000000
7        2190000
1082     2190000
1083     2090000
120      1800000
885      1751848
312      1531095
1085     1460000
18       1460000
1084     1340000
545      1340000
317      1290398
784      1076516
256      1000000
306      1000000
15        950000
324       940000
Name: Share Total, dtype: int64



In [17]:

    
df[df['Share Total']==27205398]









    Out[17]:







  
    
      
      Company
      Date
      Price
      Share Total
      Type
    
  
  
    
      1017
      KTM Industries AG
      23.02.2017
      133'306'450.20
      27205398
      Sale

Or, if we want to see the whole sorted table



In [18]:

    
df.sort_values(by='Company', ascending=True).head()









    Out[18]:







  
    
      
      Company
      Date
      Price
      Share Total
      Type
    
  
  
    
      0
      ABB Ltd
      30.10.2017
      362'229.00
      14323
      Purchase
    
    
      1
      ABB Ltd
      30.10.2017
      304'289.00
      12032
      Purchase
    
    
      2
      ABB Ltd
      30.10.2017
      10'060.00
      500
      Purchase
    
    
      3
      ABB Ltd
      30.10.2017
      10'060.00
      500
      Purchase
    
    
      931
      ABB Ltd
      15.03.2017
      15'549.00
      679
      Sale

Jetzt zu den Preisen. Wir wenden wieder unsere Function an, aber es gibt einen Fehler. Warum?



In [19]:

    
df['Price'].apply(comma)









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-f63b360c6463> in <module>()
----> 1 df['Price'].apply(comma)

~/.virtualenvs/master/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66645)()

<ipython-input-10-96f46206cf69> in comma(elem)
      1 def comma(elem):
----> 2     elem = elem.replace("'", '')
      3     return elem

AttributeError: 'float' object has no attribute 'replace'



In [20]:

    
def comma(elem):
    try:
        elem = elem.replace("'", '').split('.')[0]
        return int(elem)
    except:
        return elem



In [21]:

    
test = "26767.89"



In [22]:

    
int(test.split(".")[0])









    Out[22]:





26767



In [23]:

    
df['Price'] = df['Price'].apply(comma)

Another way of dealing with NaN Values -> Drop them



In [24]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 5 columns):
Company        1179 non-null object
Date           1179 non-null object
Price          1169 non-null float64
Share Total    1179 non-null int64
Type           1179 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 46.1+ KB



In [25]:

    
df = df.dropna()



In [26]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 5 columns):
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 54.8+ KB

Documentation



In [27]:

    
df['Price'] = df['Price'].apply(comma)

Price in Millions



In [28]:

    
df['Price_m'] = round(df['Price'] / 1000000, 1)



In [29]:

    
df.sort_values(by='Price_m', ascending=False).head()









    Out[29]:







  
    
      
      Company
      Date
      Price
      Share Total
      Type
      Price_m
    
  
  
    
      1017
      KTM Industries AG
      23.02.2017
      133306450.0
      27205398
      Sale
      133.3
    
    
      885
      Bank Coop AG
      22.03.2017
      77992272.0
      1751848
      Purchase
      78.0
    
    
      364
      Idorsia Ltd
      19.06.2017
      69765826.0
      4859499
      Purchase
      69.8
    
    
      68
      Feintool International Holding AG
      29.09.2017
      64732500.0
      616500
      Sale
      64.7
    
    
      312
      Idorsia Ltd
      07.07.2017
      30621900.0
      1531095
      Purchase
      30.6

Who sold the most?

Lets just consider the sales



In [30]:

    
df['Type'].value_counts()









    Out[30]:





Sale                  694
Purchase              473
Granting / Writing      2
Name: Type, dtype: int64



In [31]:

    
dfs = df[df['Type']=='Sale']



In [32]:

    
dfs.groupby('Company')['Price_m'].sum().sort_values(ascending=False).head(10)









    Out[32]:





Company
KTM Industries AG                        139.8
Temenos Group AG                          67.8
Feintool International Holding AG         65.4
Chocoladefabriken Lindt & Sprüngli AG     37.9
Molecular Partners Ltd                    28.0
Cosmo Pharmaceuticals N.V.                26.6
Partners Group Holding AG                 16.4
Vontobel Holding AG                       15.9
ams AG                                    15.4
Compagnie Financière Richemont SA         15.0
Name: Price_m, dtype: float64

And the buys?



In [33]:

    
dfp = df[df['Type']=='Purchase']



In [34]:

    
dfp.groupby('Company')['Price_m'].sum().sort_values(ascending=False).head(10)









    Out[34]:





Company
Idorsia Ltd                            147.0
Bank Coop AG                            78.0
Ypsomed Holding AG                      21.2
Compagnie Financière Tradition S.A.     19.8
Temenos Group AG                        14.9
Logitech International S.A.             10.8
ams AG                                  10.0
Arbonia AG                               9.6
Edmond de Rothschild (Suisse) S.A.       7.8
Nestlé AG                                7.3
Name: Price_m, dtype: float64



In [36]:

    
df.to_csv('scraped_and_cleand_six.csv')

Now lets look at the dates, this can get tricky



In [221]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 6 columns):
Company        1169 non-null object
Date           1169 non-null object
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 63.9+ KB



In [222]:

    
df['Date'].head()









    Out[222]:





0    30.10.2017
1    30.10.2017
2    30.10.2017
3    30.10.2017
4    27.10.2017
Name: Date, dtype: object



In [223]:

    
from datetime import datetime



In [224]:

    
pd.to_datetime(df['Date'], format='%d.%m.%Y')









    Out[224]:





0      2017-10-30
1      2017-10-30
2      2017-10-30
3      2017-10-30
4      2017-10-27
5      2017-10-27
6      2017-10-27
7      2017-10-27
8      2017-10-27
9      2017-10-27
10     2017-10-26
11     2017-10-26
12     2017-10-26
13     2017-10-25
14     2017-10-25
15     2017-10-25
16     2017-10-25
17     2017-10-25
18     2017-10-24
19     2017-10-24
20     2017-10-23
21     2017-10-20
22     2017-10-20
23     2017-10-18
24     2017-10-17
25     2017-10-17
26     2017-10-16
27     2017-10-16
28     2017-10-16
29     2017-10-13
          ...    
1149   2017-01-12
1150   2017-01-12
1151   2017-01-12
1152   2017-01-12
1153   2017-01-12
1154   2017-01-12
1155   2017-01-12
1156   2017-01-11
1157   2017-01-11
1158   2017-01-11
1159   2017-01-11
1160   2017-01-10
1161   2017-01-10
1162   2017-01-10
1163   2017-01-10
1164   2017-01-09
1165   2017-01-09
1166   2017-01-09
1167   2017-01-09
1168   2017-01-06
1169   2017-01-06
1170   2017-01-06
1171   2017-01-05
1172   2017-01-05
1173   2017-01-05
1174   2017-01-05
1175   2017-01-04
1176   2017-01-03
1177   2017-01-03
1178   2017-01-03
Name: Date, Length: 1169, dtype: datetime64[ns]



In [225]:

    
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')



In [226]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1169 entries, 0 to 1178
Data columns (total 6 columns):
Company        1169 non-null object
Date           1169 non-null datetime64[ns]
Price          1169 non-null float64
Share Total    1169 non-null int64
Type           1169 non-null object
Price_m        1169 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 103.9+ KB

Let's plot the transaction counts

Code overview

First we need to make the date the index



In [230]:

    
df.index = df['Date']



In [232]:

    
df.resample('W')['Date'].count().head()









    Out[232]:





Date
2017-01-08    11
2017-01-15    25
2017-01-22    21
2017-01-29    18
2017-02-05    14
Freq: W-SUN, Name: Date, dtype: int64

Lets plot that



In [233]:

    
import matplotlib.pyplot as plt
import matplotlib
plt.style.use('ggplot')
%matplotlib inline



In [234]:

    
df.resample('W')['Date'].count().plot()









    Out[234]:





<matplotlib.axes._subplots.AxesSubplot at 0x10c8eef60>



In [235]:

    
df.resample('W')['Price'].sum().plot()









    Out[235]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ca30cf8>

Shall we print this out?



In [237]:

    
df.resample('W')['Price'].sum().plot()
plt.savefig('hello.pdf')

But what about different shapes?



In [241]:

    
df.resample('M')['Price'].sum().plot(kind='bar')









    Out[241]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cc51390>



In [249]:

    
df.resample('M')['Price'].sum().plot(kind='barh')









    Out[249]:





<matplotlib.axes._subplots.AxesSubplot at 0x10db4b908>



In [252]:

    
df.resample('M')['Price'].sum().plot(kind='pie')









    Out[252]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d905128>



In [255]:

    
df.resample('M')['Price'].sum().plot(kind='pie', radius=0.5, shadow=True)









    Out[255]:





<matplotlib.axes._subplots.AxesSubplot at 0x10dd9fa20>



In [256]:

    
df.resample('Q')['Price'].sum().plot(kind='pie', radius=0.5, shadow=True)









    Out[256]:





<matplotlib.axes._subplots.AxesSubplot at 0x10df25898>



In [268]:

    
labels = 'Q1', 'Q2', 'Q3', 'Q4'
colors = ['yellowgreen', 'grey', 'grey', 'grey']
explode = (0.05, 0.05, 0.05, 0.05)
plt.axis('equal')
df.resample('Q')['Price'].sum().plot(kind='pie', radius=0.5, autopct='%1.1f%%', shadow=False, labels=labels,colors=colors, explode=explode)









    Out[268]:





<matplotlib.axes._subplots.AxesSubplot at 0x10e7b8da0>



In [ ]:

	Company	Date	Price	Share Total	Type
0	ABB Ltd	30.10.2017	362'229.00	14323	Purchase
1	ABB Ltd	30.10.2017	304'289.00	12032	Purchase
2	ABB Ltd	30.10.2017	10'060.00	500	Purchase
3	ABB Ltd	30.10.2017	10'060.00	500	Purchase
931	ABB Ltd	15.03.2017	15'549.00	679	Sale

	Company	Date	Price	Share Total	Type	Price_m
1017	KTM Industries AG	23.02.2017	133306450.0	27205398	Sale	133.3
885	Bank Coop AG	22.03.2017	77992272.0	1751848	Purchase	78.0
364	Idorsia Ltd	19.06.2017	69765826.0	4859499	Purchase	69.8
68	Feintool International Holding AG	29.09.2017	64732500.0	616500	Sale	64.7
312	Idorsia Ltd	07.07.2017	30621900.0	1531095	Purchase	30.6